arXiv:l502.07186v2 [math.NA] 4 Apr 2016 


Parallel optimized sampling for stochastic equations 


Bogdan Opanchuk, Simon Kiesewetter and Peter D. Drummond 

Centre for Quantum and Optical Science, Swinburne University of Technology, Melbourne, Australia. 


Abstract 

Stochastic equations play an important role in computational science, due to their ability to treat a wide 
variety of complex statistical problems. However, current algorithms are strongly limited by their sampling 
variance, which scales proportionate to 1/Ns for Ns samples. In this paper, we obtain a new class of variance 
reduction methods for treating stochastic equations, called parallel optimized sampling. The objective of 
parallel optimized sampling is to reduce the sampling variance in the observables of an ensemble of stochastic 
trajectories. This is achieved through calculating a finite set of observables — typically statistical moments — 
in parallel, and minimizing the errors compared to known values. The algorithm is both numerically efficient 
and unbiased. Importantly, it does not increase the errors in higher order moments, and generally reduces 
such errors as well. The same procedure is applied both to initial ensembles and to changes in a finite 
time-step. Results of these methods show that errors in initially optimized moments can be reduced to 
the machine precision level, typically around 10~ 1 * * * * * * * * * * * * * * 16 in current hardware. For nonlinear stochastic equations, 
sampled moment errors during time-evolution are larger than this, due to error propagation effects. Even so, 
we provide evidence for error reductions of up to two orders of magnitude in a nonlinear equation example, 
for low order moments, which is a large practical benefit. The sampling variance typically scales as 1/Ns, 
but with the advantage of a very much smaller prefactor than for standard, non-optimized methods. 

Keywords: stochastic, optimization, parallel 


1. Introduction 

Stochastic differential equations (SDEs) play a universal and important role in many disciplines requiring 

quantitative modeling m- Their practical advantage is that, when used to solve large statistical problems, 

the ability to randomly sample greatly reduces the complexity of treating the full distribution function. They 

are employed for treating problems ranging from statistical physics, chemistry and engineering through to 

economics, biology and financial modeling. As a result of this large field of applications, there is substantial 

literature on the algorithms used to solve them noma, and these are generally different to methods for 

ordinary differential equations (ODEs). 

This utility is not without a price. Such numerical algorithms typically utilize many independent tra¬ 
jectories. Any statistical result will therefore have a sampling variance that scales 1/Ns for Ns random 

samples. This causes to a typical error eg that scales as 1/y/Ns, giving an error that only decreases slowly 

with the total computation time. Numerical algorithms generally aim to reduce the truncation error ep 

due to the discretization in time, giving an error of order (At) p for methods of order p. Yet reducing the 

truncation error cannot reduce the total error to less than the sampling error. Since the sampling error 

is often a large part of the total error one has to deal with, reducing the truncation error has a limited 
effectiveness. 

Here we propose a new class of reduced variance SDE algorithms called parallel optimized sampling or 
POS algorithms for short. This approach can provide extremely useful variance reduction compared to 
independent random sampling methods. The method locally eliminates sampling errors over a finite set of 
moments, up to numerical precision. Higher order moment errors are reduced also. Global sampling error 
improvement over a finite time interval is not as large as this, due to error propagation effects, but is still 


Preprint submitted to Elsevier 


April 5, 2016 




very substantial. Other variance reduction techniques that can also improve sampling errors include the 
use of low discrepancy (or quasi-random) sequences |T5] , and an extrapolated hierarchy of simulations E, 
which we do not treat here. 

To achieve our results, we introduce a modification of the standard independent stochastic trajectory 
approach. A set of Ns trajectories is solved in parallel using an optimized algorithm. This gives an 
improvement that depends on the order of the statistical moment being calculated. The present method is 
a highly practical one, since the run time is still only linear in N$- This means that the error reductions are 
of real value. If the run time scaled as TVj, and errors were reduced to 1/Ns, the reductions would not be 
so useful, since the total computational overhead at fixed error would not change. 

There is a close relationship between the present approach and methods of moment hierarchies [5] or 
cumulant expansions. These existing cumulant based methods have a major drawback. One must close 
the infinite hierarchy with an arbitrary ansatz at some finite order. If this is incorrect, the method will 
fail [18] , and several examples exist for this problem. In fact, such cumulant hierarchies cannot in general be 
truncated consistently [19]. By contrast, our method requires no ansatz; it gracefully reduces in a consistent 
way to the usual, unbiased stochastic method at large sample sizes. It is a unification of moment hierarchy 
and stochastic methods. 

We focus mostly on simple one-dimensional cases here for simplicity, except for one higher-dimensional 
example to illustrate the approach. Extensions of the POS method to cases of higher-order convergence in 
time will be treated elsewhere. However, as one example of potential applications, sampling errors can play 
a major role in quantum simulations using phase-space representations. The difficulty is that distribution 
tails may only be weakly bounded [2D]. In such cases, sampling error can determine whether a quantum 
simulation is feasible or not. One approach to solving this problem is using weighted trajectory methods EQ- 
Another method, which was successfully used in treating long-range order in the fermionic Hubble model, is 
to impose global conservation laws [22 . The present approach has a more general applicability than either 
of these, but can also be combined with such earlier techniques. 

The organization of the paper is as follows. In Section [2] we analyze the stochastic errors that can arise 
in computational methods for stochastic differential equations. In Section [3] sampling errors are reduced 
in initial observables by introducing a new type of stochastic noise generator. Numerical examples of this 
are treated in Section ^ showing that variance in selected moments can be reduced to machine precision. 
Next, in Section [5j this approach is extended to dynamical optimization of stochastic equation algorithms, 
with numerical examples given in the remaining sections. In Section [6] we start by treating single-step 
optimization, which ignores error-propagation. In Section [7J we demonstrate the global, or multi-step 
performance of these methods for a number of one-dimensional linear and nonlinear cases, showing that global 
sampling error variance is substantially reduced compared to traditional independent trajectory approaches, 
typically by at least an order of magnitude. Finally, Section [8] treats a two-dimensional example which has 
very similar behaviour. Section [9] gives our summary and conclusions. 

2. Stochastic integration error 

Stochastic integration involves both truncation errors due to finite step-size, and errors due to finite 
samples. In this section we analyze the performance of independent SDE methods by considering the overall 
resources needed to solve the equations for some given error. To explain the background to our approach, 
we show that increasing the convergence order is a less effective way to reduce errors than one might expect, 
because of a trade-off between the resources needed to reduce both truncation and sampling errors. As 
large ensembles are required to reduce sampling errors, we find that higher-order methods for SDEs, while 
undoubtedly useful, are not as effective as in ODE integration. 

In the rest of this paper, we show how to overcome this problem by using parallel algorithms that reduce 
sampling errors. 

2.1. Distributions and observables 

We start by establishing our notation. The observables of a probability distribution function P (t, x) are 
the fundamental objects of interest in stochastic calculations. These are probability-weighted integrals of 
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functions o m (x) over x: 


(Om) — 


dxo m (x) P (t, x) . 


(1) 


In general, dx is a Euclidean measure over a d-dimensional real or complex space of variables x. The 
numerical examples given here will focus mainly on moments of one-dimensional distributions with o m (x ) = 
x m , giving a vector whose error should be minimized: 

o = {oi,... ,o M } • (2) 

With sampling methods, the time-evolution is solved by generating a number Ns of independent sample 
paths x^ n \t). This procedure is equivalent to approximating the true distribution by the sampled distribu¬ 
tion defined as 

n s 

Ps (t, x) = — ^2 6 ( x - x (n) (t)) ■ (3) 

S 71=1 

We note that the initial value problem consists of defining an initial static distribution -P(x) at t = to, and 
then sampling it with initial points x( m ) (to)- For dynamics, the sampled paths are defined at Nt discrete 
times ti , with a spacing At, such that exact ensemble average values are recovered in the double limit of 
infinite sample and zero step-size: 


. N S 

(o m (iO> = A lim ^ W (x (n) (ti)) - (4) 

n—1 

These paths are obtained by numerical algorithms which propagate x^ n \ti) to x^ n ) (t,;+i), with a trun¬ 
cation error vanishing as At —> 0. If we call Ns the sample number, the phase-space is made up of a set of 
distinct samples or trajectories. These form an extended vector of parallel trajectories, 

X=(x« T ••• x^) T ) T , (5) 

of size Nsd. It is convenient to define the sampled observables as a function 6 of the extended vector of 
sampled trajectories, 

1 N S 

dm (X) = — ^ ° m ' ( 6 ) 

n—1 

Throughout this paper, we will use the standard notation that all extended vectors or matrices involving 
a dimension of the order of the sample size Ng are written in upper case like X, while objects that do not 
depend on the sample size are written in lower case. 


2.2. Fokker-Planck and stochastic equations 

To explain the problem of interest here, we give a detailed analysis of a stochastic differential equation 
or SDE. The underlying distribution function P(t,x) satisfies the Fokker-Planck equation [B], 


dP (■ t , x) 
dt 




d 2 


dxjdxj 


d ij(x) 


P(t,x) 


( 7 ) 


We wish to minimize the errors for solving the corresponding Ito SDE mm, given in a standard form by: 

dx = a (x) dt + b (x) dw, (8) 

where d = bb T , and the Gaussian distributed real noise dw has correlations given by: 


(dwidwj)^ = Sijdt 
( dw i)oo = °- 
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( 9 ) 






Here the (...) notation means ensemble average, so that 
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Ns 


N s 

E< 

n= 1 


.(«) 


( 10 ) 


with (... ) 00 being the infinite ensemble limit. 

2.3. Sampling and error criteria 

To evaluate the computational error, we must define an error criterion relevant to the entire sample 
ensemble X. In practical terms, this means one must calculate errors in each required observable, rather 
than just the errors in a single trajectory. By minimizing the total error over a finite number of samples, 
one should be able to arrive at the most efficient algorithm that utilizes a given computational resource. 

To quantify the error we will use a weighted norm ||P||^ = \J ( P , P) > 0. The integration error of a 
sampled, calculated stochastic distribution Ps relative to the exact distribution Pg is then defined as 

e=\\Ps-P E \\ w , (11) 

which depends on the choice of weighted norm. 

The computed averages, (o TO ) Pg , are obtained through sampling, which means that typically one is 
interested in calculating specific observables with minimum error. To evaluate the computational accuracy 
relevant to these observables of interest, we choose a particular set of observable quantities, o m , for m = 
1,... M, to define the distribution norm, so that: 

M 

\\ P \\w= E W m \(Om) P \ 2 . (12) 

m— 1 

Here W m is the relative weight assigned to observable m. We assume the error is evaluated at a fixed time 
t, otherwise one may wish to minimize the errors at each of a set of sample times. In most of this paper, the 
observables are a finite set of one-dimensional moments, o m = x m , although other measures are certainly 
possible. For simplicity, we choose W m = 1 from now on. 

With this definition, the total error is 


M 

e= \ El I (°m) p Q ~ ( °m ) p E I ! (13) 

\ m =1 

The integration error e clearly must be calculated from the entire vector of stochastic trajectories x. The 
task of defining an optimal stochastic integration method is to obtain x ( t ) with a procedure that minimizes 
the total error e relative to some computational resource. Using the definition that 


l^m 


(°m)p E > 


(14) 


we see that: 


e 


M 


N 


X |Om (X) 


fJ"m 


(15) 


Although different to the usual probability norms, we will use the above definition throughout this 
paper, since it corresponds to the operational requirements of a norm for a sampled distribution. For a 
single observable, say x, e 2 is simply the usual sample variance in x — provided there are no other errors 
present as well. However, there are usually other errors, including errors due to arithmetic roundoff and the 
finite step-size of the integration algorithm. 
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2.^. Optimizing the sample number 

What is the optimal strategy to solve this SDE, given fixed total computational resources and a goal of 
minimizing the total error? With independent paths, the computational resource utilized is proportional to 
N = NsNt, where Ns is the number of samples and Nt is the number of time-steps. One can invest total 
processing time either in reducing the time-step, or in increasing the number of samples, but reducing the 
step-size means using less samples, and vice-versa. 

To understand these issues quantitatively, we first consider the optimal strategy for error-reduction with 
traditional methods that involve using independent sample paths. We will show that it is only important 
to reduce the step-size or discretization error to the point where it is some fraction of the sampling error. 

Consider the total error for Nt time-steps with a stochastic integration method of global order p, that 
utilizes Ns independent sample paths. From the central limit theorem, the global sampling error eg due to 
the use of a finite sample, is 

e s = a Ng 1/2 , (16) 

where cr is the standard deviation of the measured quantity, and the same type of scaling holds for sums 
over errors in multiple observables. 

There are additional errors et over a fixed time interval T = N T At, due to the finite time-step At. If the 
local error of one time-step is £At, then at best er ~ NrEAt- The discretization order p is therefore defined 
so that the global time-step truncation error et scales as 

e T = cNi^ p oc A t p , (17) 

where c is an algorithm-dependent constant that depends on the set of computed observables. 

Since the two error sources are independent, the overall error is given approximately by a sum of two 
terms: 

e = et + es = cNt p + cNg ^ . (18) 

We wish to constrain the total processor time — parallel or otherwise — so that NsNt = N is bounded. 
The total CPU time is then Tcpu = Nt, if one step in time takes a real time duration r, and other overheads 
are negligible. 

Next, consider how to optimize the tradeoff between the truncation and sampling errors. Letting c = 
cN~ p , and keeping N fixed, one obtains: 

e = cN p + aN~ 1/2 (19) 

This is minimized by choosing a combination of time-step and sample number such that: 


The total error is then: 


N s 

2 p 

= TV 2 p+ 1 

N T 

i 

= TV 2P+ 1 


' ca 2p ' 

1 

2p+l 

€ = 

2 pN p _ 



a 

2 pc 
a 

2 pc 


2 

2p+l 


2 

2p+l 



( 20 ) 


( 21 ) 


2.5. Effective integration order 

From the analysis above, the best that one can do to reduce errors is to obtain a scaling of e oc lV -Peff , 
where the effective order p e ff is 


Peff = 


P 

2p+l 


( 22 ) 
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This is because a higher order integration technique has no effect on the sampling error, which eventually 
dominates the error calculation. The best effective order, even with p = oo, is p e g = 1/2. Given that 
stochastic higher order algorithms are complex and slow, higher p values are not always an advantage. 

This optimal effective order requires the use of an optimal ratio of samples to steps, which is: 


Ns an-* 
—— = N 2 p + 1 
Nj 1 


2 pc 


(23) 


This ratio is difficult to calculate a-priori, since these constants are not usually known in advance. One 
can estimate the sampling error numerically by measuring the sample standard deviation a, which is known 
from the statistics of the simulation. The truncation error constant c is also measurable by changing the 
step-size. Thus, both c and a can be measured in software. An optimum point is reached when the ratio 
between discretization and sampling error is given by the extremely simple result that: 


€t 1 

e s 2 p ‘ 


(24) 


In summary, it is only productive to reduce the step-size to the point where the discretization error of 
1/2 p of the sampling error. After this point is reached, one is better off to use more samples. Another way 
to describe this is that the least computational cost for error e is 

N oc e~ 1/p ‘ a = e - (2p+1)/p . (25) 


There are multi-scale methods HU that can achieve a cost of N oc e 2 (loge) 2 , which is a useful further 
improvement. 

In summary, from the traditional analyses of ODE integration, one might expect that the error would 
be reduced extremely rapidly with a higher order method. This is not the case with an SDE. Because the 
independent sampling error varies slowly with resource, this becomes the dominant error with increasing 
resources. 

Our conclusion is that sampling error strongly limits the effectiveness of independent sampling methods 
for SDEs. 


3. Initial conditions for parallel optimized sampling 

We now describe a family of variance reduction methods using parallel optimized sampling (POS), which 
locally eliminate sampling errors for a finite set of observables o, thus improving computational efficiency. 
As described above, for N$ samples, there is a Ag d-dimensional extended vector of stochastic variables, 
X = ( x« T ••• -ff N s) T which is optimized so that X — > X^ opt ^ . Both initial samples and the 
resulting stochastic trajectories are optimized in parallel, thus making use of increasingly parallel hardware 
capabilities. 

Before deriving the POS algorithm, we need to address an issue arising from the known initial condition 
for the SDE, with x having a distribution P (x) at t = to. When we derive the optimization conditions, 
one of the assumptions will be that the moments of x are already optimized at the start of each time 
step, that is ( o m (i))jv s = (f) for m = 1... M. At every step of the SDE integration, if we obtain the 

optimized stochastic vector from the previous step, then, since we optimize the differentials of the moments, 
the resulting stochastic variables will be optimized as well as possible, apart from any time-step errors, which 
is a separate issue. 

But how do we ensure the initial set of moments of the distribution P (x) is optimized at the first 
integration step? 
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3.1. Static POS 

We emphasize that our main goal is to minimize the variance for the dynamical stochastic equations, 
treated next. As a consequence, we do not use methods like Gaussian quadratures for the static variance 
reduction problem. These are limited in the diversity of initial distributions treated, and they cannot be 
easily generalized to the dynamic case. Instead we take a set of random samples from P (x), and optimize 
these initial sampled variables to reduce the variance in moments of interest. 

We call this static POS, as it does not involve dynamical time-evolution. In other words, we wish to 
sample the initial values for the trajectories so that M chosen observables are equal to their known 
exact values, that is, 

b (X) = /x = (o)oo, (26) 

where /x is an M -dimensional vector. That is, we wish to set ep = ep(to) = 0, where ep is the total error © 
in the initial distribution, using an observable-based error measure. In the examples given in this paper, 
we treat all moments up to a given order. More generally, at high dimension d, treating all correlations or 
moments to a fixed order is exponentially hard. Therefore, it is necessary to select moments according some 
importance criterion, the effects of which will be treated in a later publication. 

This is a classic example of nonlinear, multidimensional root-finding, with optimization over the di¬ 
mensionality of the full space of samples. It can be a challenge to solve these equations efficiently, with 
resources no greater than O(Ns), and with minimal changes to the original sampled estimate for X(fo), 
which is labelled X( 0 ). However, in examples treated here, we find that the numerical problem can be very 
accurately solved, naturally with machine-limited numerical precision. Here we use an iterative, modified 
Newton-Raphson method for the solution. Other techniques are available as well. 

In an iterative Newton-Raphson approach, we use stochastic methods for the first estimate, X( 0 ). At 
each iteration, we set X —»• X + 5X to obtain the next estimate X( i+1 ), which has a better fit to the required 
moments, so that X( i+1 ) = X(.q + 5X(j+i), where ||dX(j +1 )|| -C ||X(q||, and || ... || denotes the Euclidean 
norm. 

Expanding to first order in i'X| t+1 v the conditions we require are: 


= o(X (i) )+J(X w )5X (i+1) +0(^. +1) ), (27) 

where J is a Jacobian of 6: 

j ” m(x)= ^yp- (28) 

In matrix form, the iteration equation is therefore 

J(i)<5X(i+i) = /x - b(j). (29) 

where we use the obvious notation that 6(q = 6 (Xpj) and J(q = J (X^). We note that out method is 
scale-invariant in the sense that it will result in the exact same solutions if we rescale any observable and 
its exact value as o,(X) —> CiOi(X), where Ci is a constant. 

Each step of the iterative solution is underdetermined, since the relevant matrices are not square, and 
hence not invertible. Such problems are relatively common in optimization, and here we analyze two different 
techniques to solve them. 

3.2. One-dimensional example of iteration equations 

We will use as an illustrative example the situation when the observables are moments of a one¬ 
dimensional stochastic equation. In this example o m = x m , so the conditions are: 


X(°pt) 


NsHm 

x w • 1 + « x w _1 • <5 X b + D + O (^ X (, + i)) . 
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( 30 ) 





where the powers of matrices and vectors are understood in the element-wise (Hadamard) sense (e.g., 
X 2 = X o X). Hence, in this case the observable vector and matrix of derivatives are: 


6(X) 

J(X) 


A n ) 


N S 

— y 

N S y [ x (n)]M 


1 

ivs 


M[xW ] M ~ 1 ••• 


M—l 


(31) 


3.3. Static iteration solution with a matrix pseudo-inverse 


The simplest method to solve the iteration equation (291 is to use a matrix pseudo-in verse. This has 
another advantage, since the pseudo-inverse has the property that it generates a least-squares solution, with 
minimal changes, as required for this algorithm. 

Thus, we have the iterative procedure 


x (i+i) — X (i) + (/x - o (i) ) . (32) 

where J + indicates the pseudo-inverse of the M x N$ Jacobian J. 

This pseudo-inverse method is simple, general and effective. It satisfies our requirements given above. 
However, standard pseudo-inverse software using singular value decomposition (SVD) methods has draw¬ 
backs. It is relatively slow, can have stability problems and may not be available on all programming 
platforms. 


3-4■ Improved matrix inversion iterations 

To improve efficiency and overcome problems with the use of general purpose pseudo-inverses, there is a 
faster alternative method for solving (291. 

The pseudo-inverse is a limit. If J ' denotes the conjugate transpose of the matrix J, then: 


J + = lim 

ot—> 0 


JJ f +al 


(33) 


Hence, provided u = J J J is invertible, one can directly obtain that J + = J^u -1 . It is important to note here 
that u is only an M x M matrix, regardless of how many samples were used originally. 

This gives as many unknown coefficients as there are equations. As a result, we only need solve M linear 
equations in M unknowns, for which there are efficient techniques, resulting in the iteration procedure 


X (i+i) - x (i) + J (q u (i) 1 - °(»)) • ( 34 ) 

Here the matrix u being inverted is anAfxM positive-semidefinite matrix. We use the lower case for it 
to indicate that it does not take an extended dimensionality. It is generally invertible and relatively small 
compared to the sample size, resulting in a highly efficient optimization algorithm. Numerical investigations 
show that this simple method is convergent for large sample number Ns, which corresponds to cases where 
the initial sample has moments close to their ideal values. 

As the stopping condition in the numerical examples given later, we choose the condition that the changes 
are small relative to the modulus of the current X value, i.e. ||JX( *+i)ll/l|X(i)|| < V = 10 8 , where ry 2 is the 
numerical machine precision, as previously. An iteration limit of / max = 50 was used, although as we see 
later, the typical iteration number is much lower than this. 

We note that this simplified method is only useful when u is invertible, otherwise the full SVD techniques 
or other more robust algorithms should be used. This is not a major limitation: as long as the first derivatives 
of our observables are linearly independent, u will be invertible. As often the case with such numerical 
methods, performance can be further improved through linear scaling or translation operations. 




However, sometimes the Newton-Raphson method doesn’t converge, not because u lacks invertibility 
(which is rare), but because the initial guess is too far away from the solution. This can be detected by 
checking if the norm of A increases. In such cases, one can back-track and try again with a different initial 
estimate. In the case of time-evolution dynamics, treated next, a smaller time-step can also be used to 
ensure the initial guess is close to hte solution. 


3.5. One-dimensional case: power series method 

As an equivalent way to understand our approach in the one-dimensional case, we can expand <5X, which 
is the iterative change in X, using the derivative matrix and a vector of changes in observables, do. With 
observables as moments, this is a series in X^ up to order M — 1, with M coefficients written as a vector d. 
The expansion is given by: 


M 


6X. = jt<So = m ( xt ) m 1 65 


N s 


m—1 


Defining a square, positive definite matrix: 


u (i) — J(*)J(j)> 


the basic iteration equation (291 becomes: 


(35) 


(36) 


U(») do(j-|-i) = p - o (i) . 


(37) 


Since the solution for d(j +1 ) requires the inversion of u ( q, this leads to the same final algorithm that is 
defined above, in Eq (341. 


4. Numerical examples: Static optimization 


In this section, we perform numerical optimizations of initial conditions, following the methods of the 
previous section. This is important because the initial sampling error can dominate the sampling error found 
throughout the calculation, if it is not optimized. For definiteness, in most of the numerical examples of 
POS algorithms analyzed here, the m-th observable is a finite moment of a one-dimensional real distribution. 
Hence, 


1 N $ r 

^ ^ N S = ^ 


71=1 


N s 


(38) 


where 1 is an TVg-dimensional unit column vector. We optimize moments, for m = 1 ,...M. As well as 
calculating the resulting error in these optimized moments, we also obtain the errors in other moments. An 
important issue is that the non-optimized moments should not have increased errors. We will also consider 
the ensuing results for two less regular observables: o eX p(a^) = exp(:r) (a combination of an infinite number 
of moments) and 0 ||(a;) = |a;| (an irregular function). 


4-1. Static optimization 

In this section we will test the performance of the iterative algorithm from Section [3] applied to the most 
common initial condition — a normal distribution. However, we emphasize that this choice is purely for 
convenience, and any initial distribution could be used. In this section we will use p = p\ = 0 for all the 
tests and set the variance a 2 = p 2 = 1. This helps to avoid precision loss when calculating central moments, 
which is important, since the method is extremely efficient and optimizes the moments almost to the limit of 
numerical precision, as seen below. If maximum precision is required for cases when p ^ 0, one can subtract 
the mean of the initial distribution so that p = 0 in the new variables. 


9 








log 10 (-R) 

Figure 1: Distribution of normalized errors in the static optimization tests for the matrix inversion method with <r = 1, 
Ng = 1000 (solid blue lines). The grey dotted curves show the normalized errors of a randomly sampled vector. The dash- 
dotted line denotes the numerical precision limit (e n um ~ 2.22 x 10 —16 for double precision floating point numbers). Zero error 
results are removed from binning for ease of plotting. 


With (i = 0, the moments of the test distribution are given by 


Mm 


0, m is odd, 

<j m (m — 1)!!, m is even. 


(39) 


The number of moments being optimized here is M = 6, but this is not essential. For the additional 
observables the expected values are the expectation of the log-normal distribution 


/^exp = exp 



and the expectation of the folded normal distribution 


(40) 


M|| =<r 



(41) 


We test the behavior of the algorithms in 10000 independent optimization attempts, each with Ns = 1000 
parallel samples, in order to collect statistics. In each attempt, a TVs-dimensional vector X is generated using 
random sampling and the matrix inversion iterative method from Section [3.4| is applied to it. We then plot 
the distribution of measures of interest, which indicate the performance of the method. This demonstrates 
how well the algorithm performs when tested over a range of different initial random samples, each drawn 
from the same underlying normal distribution. 


4-2. Distribution of the optimized variance reduction 

There are several quantities we are interested in. We first investigate the distribution of the optimized 
variance. This gives an idea of not just how much the variance is reduced, but also the probability of a given 
amount of variance reduction. 

In more detail, we plot the numerically obtained distribution of final central moments of the vector 
compared to the desired values: 


Rm (X) 


jy °(X) • 1 Hm ) 


(42) 
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Figure 2: (a) Number of iterations taken by the matrix inversion method, a = 1, Ng = 1000. (b) Relative distance D between 
the optimized vector and the initial randomly sampled vector in the static optimization tests for the matrix inversion method, 
<j=l, Ng = 1000. (c) Average relative distance ( D) depending on the number of trajectories Ng in the static optimization 
tests for the matrix inversion method, <7 = 1. The grey dashed line denotes the slope of a l/y/Ng dependence. 


where we will compare this quantity for initial randomly sampled vectors and optimized vectors. It is 
convenient to scale this to the expectation of the error in the sampled moment (for random samples) for the 
normal distribution: 


R 


m 


R 


m 


a m \Jm\/Ns 


(43) 


so the average of R m for the initial randomly sampled vectors will be close to 1 for all m. Fig. [T] shows this 
quantity both for the moments we are optimizing (m = 1... M) and for two higher moments that are not 
optimized. The errors for the additional observables are normalized simply as 


R 


exp, 11 


R‘exp.[ | 


(44) 


The optimization procedure was able to reduce errors almost to the limit of numerical precision of 
R m ~ 10” 15 in many cases. The moments that were not directly optimized show some improvement as well, 
but typically closer to a single order of magnitude. Since we use logarithmic graphs, and R m can sometimes 
be zero, we excluded these zero results from the binning. 


4-3. Number of iterations and optimization distance 

It is also useful to know the number of iterations performed in each of the algorithms. This gives an idea 
of the numerical cost, which we investigate in greater detail below. If too many iterations are needed, the 
time taken will increase to the point that one may as well use more traditional methods. 

Fig. § a) demonstrates that only a small number of iterations — typically four — are necessary for the 
method to converge, given a reasonably large ensemble. The third quantity of interest is the relative distance 
of the optimized vector from the initial randomly sampled one: 


D = 


||X( opt ) — X|| 


(45) 


We would like this to be much less than 1, which means that the optimized vector is still approximately 
“random”. Fig. [2jb) shows that it is indeed the case, and the average deviation is D ss 5 x 10” 3 for a = 1. 
Fi g- Hr) shows that the results have the desirable property that the relative distance between the optimized 
and non-optimized trajectories scales with 1 / y/Ns . This is expected from the fact that the initial sampled 
moment relative errors scale as 1/y/Ns, hence one expects the minimum required change to be of this order. 

However, if there are too few trajectories, the simple matrix inversion iteration method diverges, pro¬ 
ducing unacceptably large changes D in the distribution. For these small Ns values, the iteration limit of 
Im&x = 50 was reached, indicating a lack of convergence. 

To investigate this more carefully, the dependence of the final error R on the number of trajectories Ns, 
is plotted in Fig. [3] The results also display a sharp cut-off in the number of trajectories below which the 
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Figure 3: Dependence of normalized errors on the number of trajectories Ng in the static optimization tests for the expansion 
method, a = 1. The dotted curve shows the dependence for randomly sampled vectors. 
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Figure 4: Dependence of the cost on the number of trajectories Ng in the static optimization tests for the expansion method, 
(7 = 1. The dotted curve shows the dependence for randomly sampled vectors. 


iterative approach does not converge. More robust minimization algorithms could presumably solve this, 
but the issue was not investigated here for space reasons. 

4-4- Initial optimization cost 

For practical applications, the time taken for the optimization is also important. More accurately, the 
quantity of interest is the efficiency [231, which in our case can be expressed as Eff =1/ ^TcpuR, 2 )i where 

Tcpu is the time required to generate the vector X( opt i. 

Since in our case in a significant fraction of optimization attempts R for optimized moments is exactly 
zero, we plot the cost 


Cost = gg = TcpuR 2 - (46) 

averaged over attempts. Fig. [4] shows this quantity for all investigated observables. For the optimized 
moments (to = 1... 6) the cost of POS is significantly lower than the cost of the purely random sam- 
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pling (starting from a certain critical number of trajectories). The situation is different for non-optimized 
observables, where POS still demonstrates a decreased cost, but the difference is not as overwhelming. 

In conclusion, the expansion iterative method performs well for a wide range of parameters, while being 
fast and easy to implement. Convergence is extremely rapid when the number of stochastic trajectories is 
increased above a critical value. However, we note that many other methods of nonlinear root-finding exist, 
and we do not exclude the possibility of better techniques being available. 

We use this approach in the SDE examples given below, to pre-optimize the starting conditions for the 
SDE integration tests. 


5. Parallel dynamic optimization 

The general approach here is to assume that samples with an optimized set of observables are known at 
a given time t. Given this, a noise ensemble is chosen to give the lowest possible errors in the observables 
at a later time t + At. Overall convergence is then achieved by taking a large Ns limit. The goal of the 
algorithm is that the observable moments are obtained with minimal error. For efficiency, the resulting 
calculation complexity should be linear in Ns- 

In summary, the POS algorithm must therefore meet three essential requirements: 

1. Equations for a finite set of M observables are optimized both initially, and for local changes in time 
to a given order in At. 

2. Non-optimized observables still have their correct values to the same order in At in the limit of 
Ns —> oo. 

3. The computational cost of the algorithm should be no worse than O(Ns). 

If the observables chosen are moments, then, since higher order moments depend on lower order ones, these 
will experience a degree of optimization as well, even though not directly optimized. POS methods can also 
be combined with higher-order time-step methods, but here we focus on sampling error reduction for ease 
of explanation. Other types of observable also experience a degree of variance reduction as well. Typically 
this is not as great as experienced by the optimized moments, but this depends on the observable and the 
stochastic equation. 

There is an analogy between this approach and the method of moment hierarchies in statistical physics [5j. 
The difference is that rather than an arbitrary truncation of the moment hierarchy, the higher order moments 
are estimated in an unbiased way via sampling. 

We now describe how to extend this initial optimization algorithm to treat dynamical optimization of 
moments during stochastic time evolution of x. This involves stochastic noise terms and deterministic drift 
terms. 

5.1. Euler-Maruyama algorithm 

For simplicity, we only treat the optimization of the Euler-Maruyama integration scheme for an SDE in 
Ito form US], which has truncation order p = 1 for convergence of moments. This is a discrete expression of 
the standard Ito SDE, so that for a finite step in time At, 

Ax = aA t + bAw. (47) 

Next, we introduce A = ( a r (x^) • • • a 2 (x^s)) ) T as an extended vector of drift coefficients, 
B = diag ( b (xW) • • • b (x INs ' 1 ) ) (a block-diagonal matrix with b (x^) elements) as an extended 

matrix of noise terms, and AW = ^ (Aw(^) T • • • (Aw^ s )) T ^ as an extended vector of noises. The 

parallel sample vector X will satisfy the following vector SDE, now of dimension Ngd: 

AX = AA t + BAW. (48) 
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One requires that the sampled observables are given as closely as possible by the known infinite ensemble 
results, that is, 


6 (X + AX) - 6 (X) = (A Hx)))^ + 0(At 2 ), (49) 

where (A (o(x))) is the ideal observable change for an infinite ensemble to order At. This is given either 
by an application of a single-step approximation to the Fokkcr-Planck equation |7]) followed by partial 
integration, or equivalently by an application of Ito’s formula (6, to the stochastic equation ([8]): 


(A O m (x))^ 



1 d 2 o n 

O-i + — - 


2 dxidxj 


At + O (At 2 ) 


(50) 


In order to be able to calculate these moments in practice we must make another approximation, since 
we only know the required averages for a finite ensemble of trajectories. From the central limit theorem, if 
the individual variances are finite then: 


da„ 

dx, 


1 d 2 o, 

~ a i + X 


2 dxidxj ^ 


don 
dx , 


1 d 2 Or, 

~ a i + X ' 


2 dxidxj 



(51) 


In some cases the equality is exact; see the discussion of the error introduced by this approximation in 
Section 15.31 


5.2. One-dimensional case 

To illustrate how the method works for a single variable, we note that in the one-dimensional moment- 
based case, this result reduces to: 

(A (a: m )) 00 » m(x m ~ 1 a+ -x^b 2 ) At. (52) 

\ 2 /n s 

These methods can also be extended to higher orders in the step-size, but here we treat the equations up 
to the first order in At, and truncate higher orders for simplicity. Higher order algorithms will be treated 
elsewhere. 

At this stage, we point out the existence of more than one possible strategy for satisfying the moment 
equations. Since each has its own distinct advantages and disadvantages, two particular approaches will be 
treated here. They are described in the following two subsections. 

Both of the resulting POS methods have the following useful features: 

• Uniqueness — the matrix iteration equations have unique results. 

• Parameter independence — only the observables that are optimized are specified, not an arbitrary 
parameter. 

• Linear complexity — the time taken scales as Ns, since the inner products only require Ns operations. 

5.3. Error propagation 

With any dynamical POS algorithm, the goal of the optimization is to remove any difference between 
the sampled observables and the infinite trajectory observables. This can be achieved exactly for static 
optimization, but there are error propagation effects to be considered in the dynamical case. Before treating 
the step-wise optimization method, we briefly consider the potential effect of error propagation. 

To understand this, we note that error propagation effects depend on both the choice of observables 
and the structure of the equations themselves. Ideally, the optimized set of observables form a closed set of 
equations, but this is rarely the case in practical nonlinear calculations. 
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We illustrate the effects of error propagation by considering the one-dimensional moment-based equations. 
Initially, the error in the p- th moment is: 


{x p + Ax p {q) ) Ns - ( x p + Ax p ) c 


7^ 0 , 


(53) 


which varies as l/y/Ns for random noises. The aim of our optimization is to set the difference to zero, apart 
from time-step errors. Can this be achieved exactly, assuming that the stochastic variables themselves are 
initially optimized at the start of the simulation, that is, ( x p )n s = (x p ) 00 for p = 1, ... Ml 

To answer this, note that the drift and the diffusion terms can be expanded in the powers of x as 


a{x,t) = b 2 (x,t) =^ j g j (t)x : >. (54) 

j—0 j -o 

After substituting the expansions into the condition expressions, and expanding to order (At), 


= P 


(a?- 1 Aw + ?~y^x p - 2 (Aw 2 - b 2 At)^) 


N s 


+ E ~(x P+J 1 )oo)At 

j-M—p-\- 2 

1 oo 

E 9j(t) ((x p+j - 2 ) Ns - (^- 2 )oo) At 

j=M—p +3 


(55) 


The last two terms depend on unoptimized and thus unknown differences (x j )n s — ( x ■ , ) 00 for j > M. In 
deriving the POS algorithm, we assume that all the moments of the sampled distribution are correct at the 
start of the step in time, which allows us to neglect these higher-order moment differences. 

One possibility is that the coefficients / and g for the corresponding indices are in fact zero. This is 
equivalent to having a at most linear in x, and b 2 at most quadratic in x. If this condition is satisfied, then 
higher order moments remain uncoupled to lower order moments, and no error propagation occurs apart 
from the usual error propagation in an ODE method. 

If this condition is not satisfied, neglecting these terms will result in unoptimized high order moments 
“leaking in” to low order moments over time, with the effect more noticeable for orders closer to M. We 
ignore this effect in deriving the algorithm, which means that we can only locally remove sampling errors 
exactly, as we show later. As a result, higher-order moment errors gradually degrade the optimized moments. 
This results in globally increased errors for nonlinear SDE solutions, which is demonstrated in the numerical 
examples described in the next section. The consequence is that it is no longer possible to reach machine 
precision. However, substantial variance reduction is still possible. 


5.4. Combined optimization 

In combined optimization, all time orders in the sampled moments are combined together, to give a 
moment estimate in an analogous form to that treated previously for the initial conditions. The advantage 
of this approach is its ease of implementation and simplicity. 

However, unlike the case of the initial distribution, we typically do not know the infinite ensemble average 
moments exactly, that is, to all orders in At. Hence, in this approach the generated probability distribution 
differs from the stochastic distribution by terms of order At 2 , even in the infinite ensemble limit. 

Hence, we use the parallel SDE to provide the initial estimate for the stochastic trajectory, prior to error 
optimization, of: 

X (0) (t + At) = X (t) + A (t) At + B (t) AW. (56) 
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Obtaining the improved trajectory estimates is then essentially identical to the static requirement given 
already. We can write the moment requirement in the form (261: 


o (X (t + At)) = c, 


(57) 


where c is a vector of ideal observables, and X (t + At) = X( 0 ) ( t + At) + <5X. Since we do not know these 
“ideal” averages over infinite number of trajectories, we approximate c by assuming we know all the moments 
exactly at the start of the time interval based on knowing X(£). Changes in moments are estimated up to 


terms of order At using the approximations (501 and (511 so that, in general: 


Cm — 


do m 

dxi 


1 d 2 c 


2 dxidxj 


At) 

/ N s 


(58) 


In the one-dimensional example case, from Eq. (52), 


Cm 



+ Til 


x m ~ x a 



x m ~ 2 b 2 



At the i-th step in time, we assume that Xp +1 ) ( t + At) = X(*) (t + At) + <5X(j +1 ), so that 

6 (X (i+1) (t + At)) = 6 (X (<) (t + At)) + J (i )<JX (i+1) + 0(5X 2 i+1) ). 

5.5. Iterative solution to the combined equations 
The iterative equations to be solved are: 


(59) 


(60) 


J ( o<SX 


(*+i) 


J w- 


(61) 


Here we define the matrices J(q as in Eq. (31). The moment equation is then solved iteratively using the 
techniques outlined above, that is, either by pseudo-inverse iterations, or more efficiently by the matrix 
inversion method. With pseudo-inverses, the iteration equations are then simply: 


X 0+1) - X 0) + J (i) ( c _ °w) ■ 

For cases where u = JJ is invertible, we can define, just as in the static case: 

x (i+i) = x (q + J (i) u (i) ( c _ 6 «) • 
Numerical results using this method are given in the next Section. 


(62) 

(63) 


5.6. Individual optimization 

In the individual optimization approach, both the finite ensemble moments and the infinite ensemble 
moments are calculated to the same order in At, which allows us to optimize each order in time individually. 
This strategy permits a clear separation of ensemble and time-step errors. For this purpose, it is convenient 
to define an effective noise term, AV = BAW. This is optimized to give the final change in X. 

From the moment equations of Eq. ( [49] ) and ( |50[ , one has the required observable changes to order At: 

r)n I r ) 2 n 

E d£; AVn ( x ) + 2 E Qxjx [AVn (x) (x) - AtDnp (x)] == ° ( A<2 ) ( 64 ) 

n n n.p n P 


Here, 
to define 
extended 


since we wish to set every moment error to zero over a finite set of moments, it is co nven ient 
an error vector, e = {e\, ... cm) T ■ For example, in the one-dimensional case, from Eq. (491 the 
diffusion matrix is simply diagonal, and one has that: 


N s 


E 




m — 1 


AK + -^X™~ 2 (AV 2 - D„„A t) 


J N s 
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O (At 2 ) , 


(65) 




















The error terms can be viewed as two distinct “mean” and “variance” conditions analogous to those in 
Eq. ([9]) , and are equal to zero in the limit of an infinite ensemble. The error requirement is satisfied provided 
two individual conditions are met, representing terms of order %/At and At respectively: 


e< 2 ) 


jAV (opt) = 0 

A\d opt ) ^AV(° pt ) 




DAf 


= 0, 


( 66 ) 


where stands for double contraction, so {H : VV T ). = ^2jk'HijkVjVk- 
and H as a Hessian tensor: 


T~imnp (X) 


d 2 Om (X) 
dX n 8X p ' 


Here, we define J as previously, 


(67) 


In the one-dimensional moment-based example, the results are simpler. Due to the absence of cross-derivative 
terms, the second derivative tensor becomes a matrix, 


H ran (X) = 


d 2 O m (X) TO - 1 

dXl - m N S 


X, 


m —2 


The variance condition then becomes: 


( 68 ) 


e (2) = (AV (opt) o AV (opt) - DA tj = 0, 

where D n = D, m . 


(69) 


5.7. Iterative solution to the individual equations 

These are nonlinear equations, and to solve them we implement an iterative approach. For a large 
ensemble, the optimizing equations are nearly satisfied to zero-th order, up to errors of order 1 /y/Ns- 
Hence, it is efficient to iteratively solve the equations by linearization, giving a Newton-Raphson approach 
similar to the combined method given above. 

Defining the difference = AV( i+1 ) — AV(j), the separate conditions are linearized by assuming 

that ||<5X (i+1) || < || AV|| , to give 


J (AV (i) + <5X (i+1) ) = 0, 

\u -. (AV (i) AVf i) + <5X (i+1) AV^+AV (i) 5Xf i+1) -DAt) = 0. (70) 

Noting that H is symmetric with respect to its last two indices (Hj nnp = Timpn ), we can rewrite the terms 
with <5Xp +1 ) in the second equation as 

l -n : (<5X (i+1) AV£) + AV (i) <5Xf i+1) ) = («AV (j) ) <5X (i+1) , (71) 

which allows us to concatenate the two systems into a single matrix equation 

j (i) aX (i+1) =R(AV (i) ). (72) 

Here the doubly extended matrix J is: 

J = («iv)- (73) 

and the remainder vector R (AV) is defined as: 

( -JAV \ 

R (AV) - ( \n : (DAt - AV W AV^) j ' (74) 
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Starting with i = 0, we now take AV^ and solve these equations iteratively to obtain AV( i+1 ) = 
<5X( i+1 ) + AV(j). To obtain a linear solution for each iterative step, we can use a pseudo-inverse as before, 
so that: 


AV (i+1) = AV (i) + jJ } R (AV (i) ) . (75) 

As previously, this reduces to ordinary matrix inversion in most cases of interest since, we can define a 
positive semi-definite matrix 


u (Av) = j (AV) j f (AV). (76) 

Provided this is invertible, this gives an iterative procedure for the individual POS algorithm, which is: 

AV( i+ i) = AV(j) + J(i)U(j)R(i)- (77) 

This method can be implemented to scale as O (MNg) by pre-calculating repeating matrix elements. 


6. Synthetic one-step benchmarks 

Before we apply our POS method to solving an SDE with error-propagation over many successive mini¬ 
mization steps, it is informative to see how effective it is at finding a solution for the corresponding single 
step equation. 

In order to evaluate the statistical performance of these two methods at single-step error optimization, 
we implement the combined and individual optimization methods with 10000 separate initial ensembles, 
each of 1000 parallel samples of X and initial stochastic noises W. In all the tests, the initial elements of 
X are normally distributed with the mean 1 and standard deviation 0.1, Gq = 0.5 for all i (constant drift), 
b i = 0.5 for all i (additive noise). 

Results for more complicated synthetic tests show similar behavior, so we do not include them here. 


6.1. Combined optimization 

In this section we are interested in how well we can solve the target equation (571. The error measure in 
this case is essentially the same as in Section |4~T) 


Rm (X) = 




(78) 


where the target moments c m are given by Eq. (591. 

Similarly to the previous section, we will scale the errors as 


R 


m 


R,i 


\/At7 Ns 


(79) 


in order to bring the errors for the unoptimized, randomly sampled vectors close to 1. 

The results in Fig. [5] show that a significant improvement is achieved for all 6 moments being optimized, 
with normalized sampling errors reduced to 10~ 12 , or around twelve orders of magnitude in these examples. 
Even the non-optimized moments of orders m = 7,8 have errors reduced to ICE 5 , or around five orders. 
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Figure 5: Distribution of normalized errors in the synthetic benchmark for the combined method with At = 10 4 , Ns = 1000. 
Blue solid lines show the distribution of Rm for the optimized vector, grey dotted lines show the distribution of Rm for the 
initial approximation X( 0 ) (t + At). 




Q 

03 

_Q 

O 

Q. 





lo gi0 (N s ) 


Figure 6: (a) number of iterations taken in the synthetic test for the combined method, At = 10 —4 , Ns = 1000, (b) Relative 
distance D between the optimized vector X( opt ) (t + At) and the initial approximation X( 0 ) (t + At), and (c) average relative 
distance ( D ) depending on the number of trajectories Ns in the synthetic optimization tests for the individual expansion 
method, At = 10 —4 . The grey dashed line denotes the slope of a 1 /y/Ns dependence. 
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Figure 7: (a) Average relative distance ( D ) depending on At in the synthetic optimization tests for the combined method, 
with Ns = 10 4 - 5 (dotted yellow line), Ns = 10 5 (dash-dotted green line), Ns = 10 5,5 (dashed red line) and Ns = 10 6 (solid 
black line). The grey dashed line denotes the least squares fit for the linear part of the Ns = 10 6 line and corresponds to 
the dependence (D) oc At 129 , (b) Average relative distance ( D ) depending on Ns in the synthetic optimization tests for the 
combined method, with At = 10 -2,25 (dotted yellow line), At = 10 -2,5 (dash-dotted green line), At = 10 -2,75 (dashed red 
line) and At = 10 -3 (solid black line). The grey dashed line denotes the least squares fit for the linear part of the At = 10 —3 
line and corresponds to the dependence ( D ) oc TV^ -0 ' 49 . 


6.2. Number of iterations and convergence distance 

The method requires only a few iterations to converge, as Fig. |6ja) demonstrates. 

Next, we calculate the distance of the variance-reduced solution from the original estimate. It is important 
that this quantity is as small as possible, consistent with the targeted variance reduction, to eliminate 
systematic errors in higher-order, non-minimized moments. We compute: 


||AX(° pt )-AX (0) | 


|X(°p‘) (t + Ai)-X (0) (i + Ai)| 


M _ "- - W 1 — M-- ■ , -~yu; y- ' / ii /oq\ 

II AX (0) || - IIX (0) (t + At) — X (t) | 1 J 

to estimate how far the solution is from the initial approximation. Fig. [6f b) shows that the deviation of the 
optimized noise from the original is 0.07 on average, which is of order l/\fW s as expected. 

The difference in the orders of At in the left and the right parts of the target equation (571 leads to the 
distance D being bounded by a finite value, depending on either At or 1/iVg. This is unlike the behavior 
demonstrated by the static optimization in the previous section. Our tests show that for very small At, the 
asymptotic dependence of (D) on Ns is (D) oc 1 /\/Ns, as shown in Fig. [Tp). Similarly, for very large Ns 
the dependence is (D) oc At 3 / 2 , as seen in Fig. [7[b). 

This asymptotic error is essentially due to the fact that the moment equations are themselves an expan¬ 
sion in At, leading to a residual truncation error. 

The tests of the dependence on the noise scale (At) show that the method is stable for a wide range 
of values, but breaks down at larger values of At as seen in Fig. [8] The dependence on the number of 
trajectories in Fig. [9] displays a similar cutoff as seen in the tests of the static POS, in the results of the 
previous subsection. 


6.3. Individual optimization 

In this subsection we will use the individual method from Section |5.6| for a similar type of synthetic 
benchmark. Since the target equations we try to satisfy are different from the previous section, the definition 
of the error changes to simply 


R m (X) = |e«| + |e, 


„( 2 )| 


where the vectors and e^ 2 ) are defined by Eq. (66). 


Similarly to the approach of the previous subsection, we normalize the error as 


Rm. 


R„ 


y/ At /Ns 
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Figure 8: Dependence of errors on the number of trajectories Ns in the synthetic test for the combined method, At = 10 —4 . 
Blue solid lines show the average of Rm for the optimized vector, grey dotted lines show the average of Rm for the initial 
approximation X( 0 ) (t + At). 
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Figure 9: Dependence of average scaled errors on At in the synthetic test for the combined method, Ns = 1000. Blue solid 
lines show the average of Rm for the optimized vector, grey dotted lines show the average of Rm for the initial approximation 


X(o) (t + At). 
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Figure 10: Distribution of normalized errors in the synthetic benchmark for the individual method with At = 0.1, Ng = 1000. 
Blue solid lines show the distribution of for the optimized vector, grey dotted lines show the distribution of for the 
initial effective noise term AV. 





Figure 11: (a) The number of iterations taken in the synthetic test for the individual method, At = 0.1, Ns = 1000. (b) 
Relative distance D between the optimized vector AV( opt ) and the initial approximation AV. (c) Average relative distance 
(D) depending on the number of trajectories Ns in the synthetic optimization tests for the individual method, At = 0.1. The 
grey dashed line denotes the slope of a 1 /y/Ns dependence. 


The results in Fig. |lQ^ a) show that a significant improvement is achieved for the first 6 moments being 
optimized, with results only limited by the numerical precision. To check that numerical precision was the 
limiting factor, tests were run with the same method in quadruple precision. This demonstrated that the 
improvement was indeed limited only by the numerical precision, with errors now as low as 10 -30 . The 
stopping condition must be changed to rj = 10“ 15 to reach this greater numerical precision limit. 

6-4- Number of iterations and convergence distance 
We also calculate the distance 


||AV(°pt)_AV|| 

|| A V|| (83) 

and the number of iterations in our tests. The method also requires only a few iterations to converge, as 
Fig. 11 'a) demonstrates. Fig. ©b) shows that the deviation of the optimized noise from the original one 
is small. Unlike the combined method, there is no mismatch be twe en the orders of At in Eq. ( 661 , and the 
distance therefore converges as ( D) oc 1/y/Ns , as shown in Fig. 
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c). 


The tests of the dependence on the noise scale (At) show that this method is exceptionally stable over a 
wide range of step-size values, as seen in Fig. [12] The dependence on the number of trajectories in Fig. [13] 
displays a similar cutoff to those seen in the tests of the static POS. 


22 






































optimized 


. 



< X 1 > 

< X 2 > 

< X 3 > 







< X 4 > 

< X 5 > 

< X 6 > 

- 




0 -1 -2 -3 -4 -5 -6 


lo gi 0 ( At ) 



Figure 12: Dependence of average scaled errors on At in the synthetic test for the individual method, Ns = 1000. Blue solid 
lines show the average of Rm for the optimized vector, grey dotted lines show the average of Rm for the initial approximation 
AV (0) . 
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Figure 13: Dependence of errors on the number of trajectories Ns in the synthetic test for the expansion method, At = 0.1. 
Blue solid lines show the average of Rm for the optimized vector, grey dotted lines show the average of Rm for the initial 
approximation AV( 0 )- 
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7. Dynamical numerical results 

Next, putting all the preliminary results together, tests were performed of complete stochastic differential 
equation solutions. Although all results demonstrate strong variance reduction, the amount and nature of 
variance reduction depends on the precise equation, as explained below. 

7.1. Linear drift, additive noise 

It is instructive to first consider an “ideal” linear SDE for the application of POS, with the target 
observables being the moments of x as previously. An ideal linear case is one with a linear a and b. This is 
preferred as a first test because the errors from non-optimized moments do not get mixed with the optimized 
ones, as explained in Section |5.3| In these benchmarks, both the combined and individual methods were 
used, giving very similar performance overall. 

As an example, we used a one-dimensional Ornstein-Uhlenbeck process 

d x = (f — gx) d t + bdw, (84) 

where (dw) = 0 and (dw 2 ) = dt, and /, g and b are constants. It has well-known exact solutions for every 
moment. Namely, for Gaussian starting conditions the mean is given by: 

(x) = - (l - e~ gt ) + e~ 9t (x) t = o, (85) 

9 

and for the variance, 

{{x - {x)) 2 ) = e~ 2gt {{x - {x)) 2 ) t = o + ^ (i - e~ 2gt ) , (86) 

Higher order central moments are then calculated according to 

({x- {x)) m ) = 0, m = 3,5,... (87) 

and: 

({x - {x)) m ) = m\\((x - (x)) 2 ) m/2 , m = 4,6,... (88) 

Consequently, the raw moments are calculated from the central ones using the recursive formula 

m— 1 

{x m ) = {(x-{x)) m )-Y, 

p—0 

We also use cumulants as benchmarks, since these are a commonly used alternative statistical measure 
to the ordinary moment. The cumulants are calculated from the moments recursively, as 

m ~ 1 / _ 1 \ 

= ( X™) - E ( 7 - 1 J K ^ Xm ~ P) - (90) 

In our tests we took / = 1, g = 0.2, b = 0.5. At initial times, ®(0) is normally distributed with mean 
0.5 and standard deviation 0.1. We pre-optimized the starting distributions using the static method from 
Section [374] in order to satisfy the requirements of POS methods. 


m 

P 


<■ x p )(-(x)Y 


(89) 
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Figure 14: Linear SDE case with M = 6 optimized moments, showing the difference of the first 8 moments from the exact 
solution at t = 1 for the POS integrator using the combined method (solid blue lines) and a reference explicit Euler integrator 
(dashed red lines), with identical numbers of trajectories and samples, averaged over 8 tests. Both integrators use Nt = 8 x 10 4 
time steps. 
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Figure 15: Linear SDE case with M = 6 optimized moments, showing the difference of the first 8 moments from the exact 
solution at t = 1 for the POS integrator using the individual method (solid blue lines) and a reference explicit Euler integrator 
(dashed red lines), with identical numbers of trajectories and samples, averaged over 8 tests. Both integrators use Nt = 8 x 10 4 
time steps. 
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Figure 16: Linear SDE case with M = 6 optimized moments, showing the difference of the first 8 cumulants from the exact 
solution at t = 1 for the POS integrator using the individual method (solid blue lines) and a reference explicit Euler integrator 
(dashed red lines), with identical numbers of trajectories and samples, averaged over 8 tests. Both integrators use Nt = 8 X 10 4 
time steps. 


7 .2. Cumulant and moment variance reduction 

For our tests, we compare the deviations — due to sampling errors — from the exact solution in the 
first 8 moments and first 8 cumulants of x at t = 1 for the POS integrator. We use both combined and 
individual methods, with a reference SDE solver having the same number of samples, for comparison. The 
latter uses the same integration method (explicit Euler) and the same number of trajectories as POS. We 
set up POS to optimize the first 6 moments, so that we could look at the behavior of both optimized and 
non-optimized moments. 

In Figs. [I4| and [T5] we plotted the error in the first moments 


E[{x m ))= (a: m ) - (a; m ) (exact) 


(91) 


for the reference and the two POS integrators, averaged over 8 independent tests. In case of the POS 
methods the error is dominated by the time step error ct, while the error for the reference integrator 
decreases proportional to 1 /y/Ng. One can see that the POS error is significantly decreased even for the 
moments that were not directly optimized. The error improvement is largest for the individual method, but 
is still several orders of magnitude even for the simpler combined method. For the remaining graphs we will 
mostly focus on the individual method which give better results, to minimize the number of graphs, as both 
have similar general behavior. 

The general behavior in the linear case can be understood better if we plot the errors in the cumulants 
E [ftp] instead, as in Fig. 16 It shows that the error in the first 6 cumulants is greatly reduced, while 
the non-optimized cumulants stay roughly the same. In Fig. |15| 7-th and 8-th moments, while not being 
optimized directly, still depend on the lower cumulants, and thus benefit from their reduced error. This 
behavior is caused by the fact that in a linear SDE, cumulants of different orders are not coupled with each 
other; however moments are coupled, leading to the observed error performance. 


7.3. Nonlinear drift, additive noise 

As an example of a real-world, nonlinear SDE we can apply the POS method to, we take an equation 
with a nonlinear drift term and additive noise: 


da; = {x — a; 3 ) df + dru, 


(92) 
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Figure 17: Nonlinear SDE case with M = 6 optimized moments, showing the difference of the first 8 moments, as well as the 
exponential and absolute value function from the steady-state solution at t = 25 for the POS integrator using the combined 
method (solid blue lines). We also include the errors in an explicit Euler integrator (solid red lines), with identical numbers of 
trajectories and samples, for comparison. Both integrators use Nj< = 12.5 X 10 3 time steps. The dash-dotted grey line shows 
the slope for oc N^ 1 ^ 2 . 


where (d w) = 0 and (dip 2 ) = 1. This is similar to a known case where moment hierarchy methods fail to 
give correct results [T5] , 

For large times t, Eq. (921 converges to a steady-state solution. For any given observable (/ (x)), we can 
find its steady-state value using the formula 


(/(*))« = 


I-oo / (*) e ‘ 


,X — -kX 


dx 


rOO 

J —C 


dx 


(93) 


We initialize our ensemble with normally-distributed numbers with mean 0 and standard deviation 
and integrate Eq. (921 using a combined POS-method as well as an unoptimized Ito-Euler method until 


t = 25 and then compare with the steady-state values. The POS-optimized integration includes an initial 


static optimization. Integrating until t = 25 ensures that Eq. (92) predicts observables sufficiently close to the 


steady-state values. We assume that after some integration time, the time evolution of the observables will 
be dominated by a transient which is exponentially converging to the steady-state solution. A least-squares 
fit of the time-evolution of the 8’th moment is in excellent agreement with the time-evolution resulting from 
our simulations. From the fitting parameters, we can read off a decay of more that 50 mean lifetimes at 
t = 25. We conclude that the deviation from the steady-state value due to finite integration time is negligible 
compared to sampling errors. As in the previous section, we plot the first 8 moments and first 8 cumulants 
of x at t = 25, but we optimize only the first 6 moments when the POS is used. In order to show the 
behaviour of non-polynomial observables, we also plot the exponential and absolute value of x. 

In Fig. IT?] we are plotting differences from the accurate solution for the first 8 moments: 


E[{x m )} = \{x m )-(x m ) a 


(94) 


as well as for the exponential and absolute value function defined analogously to Eq. (94) averaged over 120 


independent test runs for the POS integration and 600 runs for the reference integration. 

The POS integrator consistently outperforms the reference integrator, with errors reduced 
by a factor of 10 — 100. 

A sampling error reduction of this order of magnitude requires 10 2 — 10 4 times more samples using 
standard integrators. 
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Figure 18: Nonlinear SDE case, showing the difference of the first 8 cumulants from the steady-state values at t = 25 for the 
POS integrator using the combined method (solid blue lines) and a reference explicit Euler integrator (dashed red lines). Both 
integrators use Nt = 12.5 X 10 3 time steps. The dash-dotted grey lines shows the slope for oc N s 1 ^ 2 . 


In Fig. [l8]we show results for the cumulants in the nonlinear case, for completeness. Unlike the linear 
case, where the improvement was confined to only the explicitly optimized cumulants, we see that in the 
nonlinear case even the non-optimized cumulants are optimized as well. This is because the nonlinear 
equations couple all cumulants with each other. The results show that POS results in dependency slightly 
better than N s ' " for the even cumulants. This is an interesting and noteworthy results though we don’t 
have an explanation for it yet. 


In a separate analysis, we have integrated Eq. (921 until t = 4 with a time-step of At = 4 x 10 4 using 


both the combined and the individual POS-method as well as an explicit Ito-Euler reference method. We 
have compared the values of the first 8 consecutive moments and cumulants with that of a regular SDE 
integrator with Nt = 8 x 10 4 time steps and Ns = 10 9 trajectories, using a central difference integration 
method m- The large number of trajectories for the regular SDE integrator ensured that we obtained a 
sufficiently accurate estimate for the true value of the moments and cumulants at t = 4. The comparison 
showed that individual POS method performed very similarly to the combined POS method, resulting in a 
similarly good accuracy improvement. In the interest of space, we are limiting our results to the combined 
POS method from now on. 


7-4■ Irregular drift, additive noise 

We now investigate the properties of an SDE whose functional behaviour is not a regular polynomial. 
We chose the SDE 


dx = x (1 — \x\) dt + Aw, 

where (dm) = 0 and (Aw 2 ) = 1. 

We can obtain the steady-state solutions for a given observable (/ (a;)) via 


(/(*)>„ = 


ST OQ f(x)e x2 -l\ x ^Ax 
f°° e x2 -fH x2 da; 

J — OO 


(95) 


(96) 


We run a simulation identical to the one in Section 7.3 where we replace the SDE by Eq. (95). An 


analysis analogous to that in Section |7.3| reveals an equally good convergence to the steady-state values at 


t = 25. The results given in Fig. 19 show that the overall performance of POS for the non-regular SDE is very 


similar to that for the nonlinear SDE studied in Section 7.3 Despite the fact that Eq. (951 is non-regular, 


we observe virtually the same rate of improvement through POS as we did for the regular SDE. 
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Figure 19: Nonlinear SDE case, showing the difference of the first 8 moments, the exponential and absolute value function 
from the steady-state solution at t = 25 for the POS integrator using the combined method (solid blue lines) and a reference 
explicit Euler integrator (solid red lines). Both integrators use Nt = 12.5 x 10 3 time steps. The dash-dotted grey line shows 
_ 1/2 

the slope for oc N s 
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Figure 20: Nonlinear SDE case, showing the difference of the first 8 cumulants from the accurate solution at t = 25 for the 
POS integrator using the individual method (solid blue lines) and a reference explicit Euler integrator (solid red lines). Both 

q — 1/2 

integrators use Nt = 12.5 X lO' 3 time steps. The dash-dotted grey line shows the slope for oc N s 
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8. Two-dimensional case 


Finally, to demonstrate that POS can be applied to SDE’s with dimension higher than 1, we briefly 
consider the so-called laser equation. This is a simplified model for the single-mode quantum statistics in 
a laser system. Although it can be written as a single complex equation, it is an example of an SDE with 
dimension higher than one in real variables. We consider 


da = 


(i - M 2 ) 


a dt + b dW c , 


(97) 


where both a and d,W c are complex-valued. The noise obeys (dW c dW*) = 2 dt. Here a is proportional to 
the mode amplitude, and b is a real-valued parameter that depends on the steady-state photon number of 


the laser mode ESI- By treating the real and imaginary parts of a as separate variables, Eq. ( |97| constitutes 
a two-dimensional real-valued SDE. 


The scaled photon number n is given by n = |a 
n to be 

^2 
n 


= 1 


A simple calculation reveals the steady-state value for 


(98) 


7T exp [1/ {2b 2 )] [l + erf (l/ (a/2 b))] 

Using Ns = 131072 trajectories for the real and imaginary part 5ft {a} and 3 {a}, resulting in Ns = 262144 
trajectories in total, we initialize the ensemble with normally-distributed numbers with mean 0 and standard 
deviation for 5ft {a} and 3 {a} and carry out a POS-optimized integration using the combined method 
as well as an explicit Ito-Euler reference integration with an integration time of t = 25. An analysis similar 


to that in Sections |7.3| and 7.4 reveals an equally good convergence to the steady-state at t = 25. We then 


compare the value of n = \a\ at t = 25 with the steady-state value. 

8.1. Error in the equilibrium photon number 

We are optimizing (constraining) every moment and cross-moment for the real and imaginary part of a 
up to order 4. In other words, we are optimizing the first 4 consecutive moments of 5ft {a}, of 3 {a} and 
cross-moments of the form 5ft {a} n 3 {a} m , with a combined power n+m not greater than 4. This results in a 
total of 18 constrained moments. The POS optimization includes the optimization of the initial trajectories. 

We vary the parameter b over a range from 1 ■ 10 -2 to 10.24 in 11 steps. As in Sections 7.3 


and 


7.4 


we 

average over 120 independent test runs for the POS integration and 600 runs for the reference integration. 

As a typical physical example of a scientifically important quantity in a stochastic calculation, Fig. [2l] 
shows the error in calculating n s , the equilibrium photon number in the laser cavity. Here we use the 
combined Ito POS method (blue line), and a non-optimized stochastic integration (red line) for the laser 
equation for different values of b. 

This example shows that POS can be successfully used to improve the accuracy of complex-valued SDEs 
and, more generally speaking, real-valued multivariate SDEs by several orders of magnitude. 


9. Conclusions 

In summary, we have proposed and implemented a novel variance reduction technique for solving stochas¬ 
tic differential equations, using parallel optimized sampling. The essential feature is that it unifies moment 
hierarchy and independent stochastic methods. 

A finite moment hierarchy condition is imposed as a nonlinear constraint on the random noises generated 
at each step in the integration. This gives a dramatic reduction in sampling error for all moments calculated. 
In the case of linear equations, we find that the low order moments have sampling errors reduced to machine 
accuracy. While higher order moments also have their errors reduced, the higher order cumulants are not 
affected. 

For nonlinear equations, the error reduction is not as large, but it occurs over all moments and cumulants 
studied, even including the non-optimized cumulants of higher order than the optimization limit. We 
emphasize that the proposed algorithms have a general applicability to all types of stochastic equations, and 
can be extended in principle to higher order methods as well as the simple Euler integration treated here. 
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Figure 21: Accuracy of the scaled photon number n s for the laser equation using a reference explicit Euler integrator (red line) 
and a POS-optimized integration (blue line) as a function of the parameter b. Both integrators use Nt = 12.5 X 10 3 time steps. 
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